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ABSTRACT 

Genome-wide association studies (GWAS) have 
become a preferred method to identify new 
genetic susceptibility loci. This technique aims to 
understanding the molecular etiology of common 
diseases, but in many cases, it has led to the 
identification of loci with no obvious biological 
relevance. Herein, we show that previously unrec- 
ognized sequence homologies have caused 
single-nucleotide polymorphism (SNP) microarrays 
to incorrectly associate a phenotype to a given 
locus when in fact the linkage is to another distant 
locus. Using genetic differences between male and 
female subjects as a model to study the effect of 
one specific genomic region on the whole SNP 
microarray, we provide strong evidence that the 
use of standard methods for GWAS can be mislead- 
ing. We suggest a new systematic quality control 
step in the biological interpretation of previous and 
future GWAS. 

INTRODUCTION 

Genome-wide association studies (GWAS) use micro- 
arrays of oligonucleotide probes to identify associations 
between single-nucleotide polymorphisms (SNPs) and a 
given phenotype. DNA is digested by restriction 
enzymes into restriction fragments of hundreds of bases, 
marked with fluorescent bases and hybridized on micro- 
arrays containing millions of oligonucleotidic probes that 
are complementary to the SNP's flanking sequences. 
When a given variant of a SNP is present, the restriction 
fragment containing it will hybridize on the corresponding 
probe through the complementarity of the probe and 
the SNP's flanking sequence, and the variant will be 



detectable by its fluorecence signal. In theory, a sequence 
variant with an effect on the phenotype should be located 
in the region surrounding the identified SNPs. Currently, 
the interpretation of GWAS is focused on the exploration 
of these regions (1). In some cases, this strategy has 
allowed for the discovery of the underlying molecular 
mechanism of a phenotype or disease (2). However, 
many SNPs identified to date have not provided physio- 
logical insights (1,3). Because these SNPs have been 
identified with a high level of statistical significance and 
often have been validated by independent replication 
studies (1), we believe that they correspond to true differ- 
ences in the DNA samples used for analysis, and we 
sought for different reasons for a statistical link between 
a SNP and a phenotype. 

One possible explanation is that we cannot yet compre- 
hend the biological function of the variants we detect. In a 
recent study, the genetic variations causing the association 
of a locus with a chronic renal disease was discovered only 
years after the locus was identified by GWAS (4). 

Another and more troublesome possibility is that the 
SNP microarray technique used for GWAS systematically 
associates a phenotype with an irrelevant locus, distant 
from the genetic sequence(s) responsible for the pheno- 
type. This would mean that variations in DNA, 
although spatially unrelated to the SNP, can alter its cor- 
responding signal on a microarray. 

Genetic differences between sexes (i.e. the presence of a 
Y or a second X chromosome) present the possibility of an 
experimental design to investigate the effect of a defined 
chromosome on the whole SNP microarray results, 
including results concerning autosomes that 'should not' 
be altered by differences of sex. Therefore, we performed a 
GWAS on control patients from available data sets, 
searching for autosomal SNPs associated with sex status 
that would not be found if the probes on the array are 
really specific. 
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MATERIALS AND METHODS 

Data sets 

Five different data sets from previous publications were 
used. Data set 1 was obtained from 161 control subjects 
(45 male and 1 16 female subjects) using the Illumina Quad 
v3 370 k microarray (5). Data set 2 was obtained from 126 
control subjects (64 male and 62 female subjects) using the 
Affymetrix 500 k array (6). Data set 3 was obtained from 
the HapMap CEU phase 2 and included 90 subjects 
(44 male and 46 female subjects). Data set 4 was 
obtained from the HapMap CEU phase 3 and included 
165 subjects (80 male and 85 female subjects) (7). Data set 
5 was obtained from 100 control subjects (50 male and 50 
female subjects) using an Affymetrix 6.0 microarray (8). 

Statistical analysis 

Associations and correlations were considered statistically 
significant when the P- value was <10~ 7 . No filters were set 
for Hardy-Weinberg equilibrium, minor allele frequency, 
or no-call rate, as our study uses sex difference to study 
the effect of sex chromosome variations, which do not 
follow the same distribution as autosomes. For the 
analysis of Data sets 1 through 4, we performed associ- 
ation test on sex using the PLINK whole genome associ- 
ation analysis toolset (Purcell, PLINK vl.07, http://pngu 
.mgh.harvard.edu/purcell/plink/) (9). For the analysis of 
Data set 5, we compared the probe intensities in men 
versus women using a two-sided Mest with the 
MutiTtest function of the ClassComparison package for 
the R software (Coombes, htpp://bioinformatics 
.mdanderson.org/OOMPA, Team, R Development Core, 
http://www.R-project.org/). Next, we calculated the 
average intensity value for the most reproducible SNPs 
(intensity values of replicate probes of a SNP showing a 
correlation with r>0.7 and P< 10~ 9 by Pearson's linear 
correlation test) and analyzed the correlation between 
these 46 SNPs' average probe intensity ratios in female 
subjects using a Pearson's linear correlation test with the 
two-sided correlation test function of R. 

Sequence alignments 

The Basic Local Alignment Search Tool (BLAST) (10) 
was used to search for sequence alignments on the 
human genome in the NCBI build 37.2. For short se- 
quences (probes), the search parameters were set to 
default for the BlastN algorithm except a word size 
of 15, an expect threshold of 0.05, and no filter for low 
complexity and species-specific repeats. For larger se- 
quences (restriction fragments), the search parameters 
were set to default for the megablast algorithm except a 
word size of 20 and an expect threshold of 0.05. 

Identification of studies with false results 

We performed a literature-wide search for GWAS that 
identified the genes neighboring the SNPs. We searched 
PubMed and the Gwascatalog (www.genome.gov/ 
gwastudies, accessed 1 December 2011) for genes neigh- 
boring the gender-associated SNPs. We picked a few 
studies for which the precise data needed to check the 



hypothesis of a sex-related bias were available to us, and 
we analyzed the data in detail. 

RESULTS 

Sex modifies the results for autosomal SNPs in 
microarrays 

We performed a genome-wide association study on sex in 
four independent data sets. All data were from control 
subjects. The data were obtained using the following 
technologies: an Affymetrix 500 k microarray, an 
Illumina 370 k microarray, and HapMap CEU phase 2 
and phase 3 genotypes. In all four data sets, we found 
SNPs that were allegedly located on autosomes but that 
exhibited significantly different genotype frequencies in 
men and women. These results are detailed in Table 1. 
When the analysis was restricted to highly statistically sig- 
nificant SNPs (P< 10" 7 ), we were still able to identify six 
SNPs from the Affymetrix 500 k array and six SNPs from 
the Illumina 370 k array that were located on autosomes 
and associated with sex. The analysis of the HapMap data 
yielded 35 and 17 SNPs from the HapMap phase 2 and 
HapMap phase 3 genotypes, respectively. Interestingly, 
one locus was associated with sex in all the data sets 
(near the TPTE2 gene), and four other loci were found 
in at least two datatsets (near the WWC2/CDKN2AIP, 
ADAMTSL3/UBE2QP1, PPP1R12B and PTGER4 
genes). 

Replicated sequences in autosomes and sex chromosomes 
explain the effect of sex on autosomal SNPs 

Because Mendelian principles of allelic transmission do 
not explain the association of autosomal loci with sex, 
we investigated whether nucleotide sequences on sex 
chromosomes could hybridize to the oligonucleotide 
probes of autosomal SNPs in various microarrays. 
Analysis of 28 of the SNP-flanking sequences (i.e. one 
for each autosomal locus we had found associated with 
sex in the first step) using the BLAST revealed that 21 of 
the 28 probes shared total or partial homology with se- 
quences on the Y or X chromosome. All alignments of the 
SNP-flanking sequences and their locations on the genome 
can be found in Supplementary Data sets SI and S2. 
Figure 1 shows the sequence alignment of a representative 
SNP-flanking sequence with an autosomal target sequence 
and with the homolog on a sex chromosome. We picked 
28 random SNPs among those who were not found to be 
associated with sex and used them as control. The BLAST 
alignment showed that 26 of 28 SNPs had flanking se- 
quences fully specific of their theoretical location, one 
had one homology on another autosome, and only 1 of 
28 had many weak homologies on other chromosomes 
including chromosome X (Supplementary Data set S3). 
We then aligned all probes' flanking sequences from 
Data sets 1 and 2 on the chromosome X and Y 
sequence, and the association of autosomal SNPs with 
sex versus homologies on sex chromosomes is represented 
in Figure 2 and Supplementary Figure SI. When 
comparing Chi square statistics of probes with homologies 
versus probes with no homologies on sex chromosomes, 
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Table 1. SNPs with genotypes significantly associated with gender according to various platforms 



rsID 


l hrniiin anm/i 

v_ / it (7/ / 1 LA) (7/ / it. 


P/~l C/ ///111 
J (A? £ I / (7/ / 




SNP 


OHH *i tin ( fpnia Ip /ttim 

V7UUj 1CIL1X7 \L\,1I1vII^11jl11v_/ 


P-V&\ LlC 


AffvmAtriv ^001 

/\iiyint:irix juu r 












7 775 v 1 0 — 20 
/ . / / J X 1 u 


r«48691 88 
I stooz loo 


4 


1 8473051 9 
1 oh- / juj 1 y 


WWr 9 ICViY M9 ATP 


TIC 


0 
u 


rc9880301 

razoou ju i 


1 3 


1 8008534 

1 0770JJ4 


TPTF9 /MPHOSPHS 
Irl l,Z/.vir I IU.jr 1 1 0 


T/P 


0 
u 


8 68^ v 10 — 20 

O.OoJ X IU 


i-«388301 3 


[ 5 


89889661 
ozo 0 700 1 


AFlAMTtiT 3/7*srAlV9/T IRF90P1 




0 


8 685 v 10 -20 
0.00 J X 1 u 


r«388301 1 


| 5 


89889398 

OZ007J70 


AnAMTST 3/7*irAlM?/TIRF?OP1 

AUALXl 1 JLjJf Aiy Z, j v I) I^ZV^ I I 


G/C 


0 
u 


1 368 y 1 O -19 
1 . 300 X 1 u 


rs3883014 


| 5 


89889733 
0 *loo y 1 j j 




C/G 


0 


1 527 x 10~ 19 


r«9998976 


1 9 


63971459 

OjZ / 1 tJZ 


7MF 773/7XTF1 35 




37 3 

3 / .3 


9 5 1 5 v 1 0 -08 
7.J1J X 1 u 


Tlliiminci 37ft k 

iiiumind j /uk 














rcl 9734338 

ra i z / jtj jo 


1 
j 


900736346 
ZUU / JO 34-0 


IIlI IvIZI) 




0 

U 


3 ^7 y 10 — 31 
3. J / X 1 U 


r«3881 953 


] 


900794644 

ZOO / /T-UH-t 


PPP1R17R 
rriiivizij 


A /fi 


0 


475 x io~ 31 


r«381 7999 


] 


900731 383 

ZUU / J 1 Jo J 


PPP1R17R 
rriiivizij 




0 
u 


1 09 y 10 -30 
1 .UZ X 1 u 


f C i 97/1 "1/101 
la 1 Z /"HOM-U 1 


1 
j 


900743971 
ZUU /tJZ / 1 


PPP1R1 9R 
111I IvIZI) 




0 
u 


1 09 y 1 0 -30 
1 .UZ X 1 U 


r«34868670 


5 


40973600 

4UZ / 30UU 


PTGER4 


CIT 


0 

u 


1 53 x 10 -30 


rclAS, 1 078 
raz^ j 1 u / o 


1 3 
1 J 


1 8QQ698Q 
1 077OZ07 


III SLiL 


V 1 t 


0 03 1 1 1 
U.U3 111 


8 ^6 y 10 — 25 
O.JO X 1 u 


1 1 d|)IIId|) V I . 1 VZ 












rsl 556557 
I c) 1 JJUJ J / 


] 


941 046639 


R 74D 1 P4 T or 1 00 1 1 

Iv.jLZHU 1 l_ V 7 V 1 UU 1 Z 


A /Pt 


0 

u 


6 06 y 10 — 15 
O.UO X 1 u 


rc38 1 7997 
Is 3o 1 I ZZ / 


1 
J 


900731 465 
ZUU / 3 14-OJ 


PPP1R1 9R 
rrrlRliJ) 


P/A 


0 
u 


6 06 y 10 — 15 
O.UO X 1 U 


rs4084639 


] 


900776787 


PPP1R19R 
111 1 1\ 1 ^ i_» 


C/G 


0 


6 06 x 10 -15 

U.L/U X IO 


rs 109 14658 

131v7 17UJO 


] 


33303337 

_J J J V7 -> J J I 


AK2A ADC 


A/G 


0 


6 06 x 10~ 15 


i-d 9734001 
IS 1Z / 3 I -+UU 1 


1 
J 


9006S7S37 
ZUUO J / J3 / 


PPP1R1 9R 
rrrlRliJ) 


T/P 


0 
u 


6 06 y 10 — 15 
O.UO X 1 U 


r«1 97391 53 

I a l Z / 3? 1 3 J 


] 


941 049487 
Zt 1 Kj'-xy'-tO 1 


R^T 94niP4 T OPIOOI? 

RoLZtU 1 -T H - , LUL 1 UU 1 Z 


T/Pt 


0 
u 


6 06 y 10 — 15 
O.UO X 1 u 


rd 9741 41 S 

ra iz /titij 


1 
J 


900741 3Q7 
ZUU 1 J7 / 


PPP1R1 9R 
rrilRliD 


a in 


0 
u 


6 06 y 10 — 15 
O.UO X 1 u 


rsl 731 901 0 


] 


9991 S6006 


ATTRP1 1 CTPC5 

/ W 1 I J I 11, V 1 1 V. J 


C/A 


0 


6 06 x 10 -15 

u.ou x 10 


rc1 7809433 
1 a 1 / OUZtJ J 


1 


94901 3S7 

7t7U 1 3 J / 


TEKT4 


T/P 


0 
u 


6 06 y 10 -15 
O.UO X 1 u 


i-«48691 88 

I atoOZ 1 OO 


4 


1 84S99364 

1 0tJ7ZjU L r 


T Of 1 00 1 9798 1 rT)K XT9 A TP 

I_ V /V 1 UU 1 Z / 70 I , ^ 1/ l\ . > Z / \ 1 1 


T/P 


0 

u 


6 06 y 10 — 15 
O.UO X 1 u 


l-c9QQQ900 

I SZ777ZW 


1 3 


1 8887941 
1 00 0 / y-t 1 


TPTE2 


T/P 


0 


6 06 y 10 -15 
O.UO X 1 u 


r«388301 1 


1 5 


87889398 
0ZOO7J70 


TTRF909P1 

U I) I >Z Z I 1 


P/Pi 


0 
u 


6 06 y 10 — 15 
O.UO X 1 u 


r«388301 3 

LSJOOJU 1 3 


1 ^ 

1 j 


8988Q66 1 
0Z007OO 1 


T TRF909P1 


P/T 


0 
u 


6 06 y 10 — 15 
O.UO X 1 u 


rsl 7301 091 

I S I / jU 1 UZ 1 


[ 5 


8961 3080 

OZO 1 3UOU 


AFI AlYTT*sT 3 


P/P 


0 


6 06 y 10 -15 
O.UO X 1 u 


r«9509344 


] 


941 1 373S4 

ZJ-r 113/3 JH- 


T Of 1 00 1 99949 T OP 1 00490963 

I_ V 7 V 1 UU 1 Z77H- ; , LUL 1 UUH-ZUZ03 


A Id 


0 
u 


6 81 y 10 -15 
O.O 1 X 1 u 


re1 9734338 
fa 1Z / jtj jo 


1 
j 


900736346 
ZUU / 303*4-0 


PPP1R1 ?R 
IIlI IvIZIj 


P /T 


0 
u 


6 81 y 1 0 — 15 
O.O 1 X 1 u 


r«388301 4 


| 5 


89889733 

OZO 0 7 / 33 


T TRF707P1 

U IJ> I >Z\^Z I 1 


P/P 


0 


6 81 y 10 -15 
O.O 1 X 1 u 


rc388 1 0^3 
lajoo 1 y J J 


1 
j 


9007Q4644 
ZUU / "M-OM-H 


PPP1R1 ?R 
1 1 1 I Iv 1 Z I j 


a in 


0 
u 


7 67 y 10 — 15 
/ .0 / X 1 u 


re1 778^06 
ra i / / ojyo 


1 
j 


1 4370963S 
1^+3 / UZ033 


pnFdnip 


A /T 
A/i 


0 
u 


8 66 y 10 — 15 
0.00 X 1 u 


r«1 9743401 
I a 1 Z / L fJ L r\J 1 


] 


900743971 

ZUU / tJZ / 1 


PPP1R19R 
r r r 1 1\ 1 z 1 > 


P/T 


0 

u 


8 66 y 10 -15 
0.00 X 1 u 


rc9880301 
IaZOOU3U 1 


1 3 
1 j 


1 8QQ8S34 

1 0770JJt 




T/P 


0 
u 


1 06 y 10 — 14 
1 .UO X 1 u 


r«38471 94 

LoJ Cr / 1 ZH- 


7 


1 37849064 

13/ OfZUUt 


TR TM94 

1 IvlIVlZH- 


P/A 


0 
u 


1 53 y 10 -14 
1 . J J X 1 u 


1 1 1 66966 
lal 1 1 UUZOU 


] 


99771 89S 

77 III 0Z3 


T PPR4 PAT MFl 

L 1 1 I\H-. I / \ 1_ IV1 U 


T/P 


0 

u 


1 87 y 10 -14 
1.0 / X 1 u 


rcl 9793357 

ra iz / zjjj / 


1 
J 


941 1 8^1 3S 
Z'+l 1 OJ 1 33 


T On 001 9QQ4Q T OP1 00490963 
I_V /v 1UU 1ZV y^ty, LUL 1 UU^tZUZ03 


P/T 


0 
u 


1 87 y 1 0 — 14 

1.0/ X 1 u 


rc301 33Q8 


1 
J 


941 90QS8Q 
Z'+l ZUV J07 


T Of 1 00 1 9QQ4Q T Or 1 00490963 
I_V 7v 1UU 1 Lyy^J^ LUL 1 UU iL rZUZ03 


T/P 


87 


8 60 v 10 — 14 
O.OU X 1U 


i-c 9 300647 


] 


91 1 30771 

71 1 3U III 


T Ori00505891 7NF644 

Lwv, 1 UU JU JOZ I , ZjI > OHH- 


P/T 




9 65 y 10 -14 
7.0J X 1 u 


rsl 7047395 


3 


16568435 


RFTN1 


G/A 


0.01 149 


1.09 x 10~ 13 


re1 93798 1 8 

ra izj / zo 1 0 


1 3 

1 j 


46^8 1 1 96 
^tOJO 1 1 ZO 


T4T9R A 
in 1 ztv/\ 


a in 


0 09999 

u.uzzzz 


1 03 y 10 — 13 
1 . J J X 1 u 


r«351 881 

1 a3J 1 OO I 


90 
zu 


693141 04 

OZ3 1 H- 1 UH- 


MYT1 


T/P 


0 02222 


2 19 x 10 — 13 


rc68901 98 
laoozu 1 zo 


A 
-+ 


01 7001 0Q 

VI/ UU1 U7 


F A M 1 Q0 A 
r/\lvl 1 7UA 


a in 


0 
u 


1 1 3 y 1 0 — 12 
1.13 X 1 U 


I S I 1 OO / t7U 


19 


93750678 

Z3 / JUO / O 


RPSAPSS 


P/A 

VI \ 


0 01 966 
u.u 1 zoo 


1 45 x 10 -12 


rc4860568 
raM-oou joo 


A 
4- 


646Q0Q77 

o^toyuv / / 


1 £ZV_ l\ 1 . 


a in 


0 01 9QQ 
U.U 1 Z77 


9 03 y 10 — 12 
Z.U3 X 1U 


r«0881 1 57 

I a/OO 1 1 J / 


3 


35696953 

3 JOZO? J3 


ARPP21 


P/A 


0 09564 

U.UZ JU't 


\ 99 x jo -1 1 


rs4685345 


3 


1 6585459 

1 O30 34- JZ 


RFTN1 


P/P 

VI V 


0 099 


6 16 y 10 -10 
O. I O X 1 u 


rs6803994 


3 


1 6599069 

1 UJ7ZUU7 


RFTN1 


G/C 


0.09702 


1.51 x 10~ 09 


n<l|JIIU|J V_ ILL \ 3 












3 631 y 10 — 26 

3 .03 1 X 1 U 


r«34868670 


5 


40973600 

tUZ / 30UU 


PTGER4 


P/T 


0 
u 


rc47371 1 8 
la^t / J 1 1 1 0 


Q 
O 


43S331 79 

^i-JJJJ l I Z 


POTF A 
rU I n,A 


P/A 
V 1 \ 


0 
u 


3 631 y 10 — 26 
3. 03 1 X IU 


r«1 9743401 

ra iz /mo'H-u 1 


1 
J 


900743971 
ZUU /'4-3Z / 1 


PPP1R1 ?R 


P/T 


0 
u 


4 603 y 1 0 — 26 
■4-.OU3 X IU 


rcl 99 1 4551 

ra 1 zz i^t j j 1 


A 
O 


9QQ 1 748 
Lyy 1 /M-O 


CCD PTXIR8P1 


P/T 


0 
u 


5 63^ y 10 — 26 
3.03J X IU 


rc3601 Q0Q4 
la JDU 1 VUV-t 


j 


409731 3 1 
^tUZ / 3 1 3 1 


l 1 ^ j r. i\ -f 


A /P 


0 
u 


8 1 8 8 y 1 0 — 26 
0. 100 X IU 


rc7808559 

la/ 0U0 J JZ 


7 


630661 68 

03UOU 1 OO 


VMI R 36P T Or 10041 9780 
V 1> 1 l\.JO.r , LVt'V^ 1 UUt - 1 y 1 ou 


P/A 

VI \ 


0 

u 


9 978 y 10 -26 
7.Z / 0 X 1 u 


i-c38 1 7999 
la 3o 1 / ZZZ 


1 
j 


900731 383 
ZUU / 3 1 3o3 


PPP1R1 ?R 


T/P 


0 
u 


Q 830 y 10 — 26 

7.DJ7 X IU 


rs3994533 


15 


82882831 


ADAMTSL3, UBE2Q2P1 


T/C 


0 


9.839 x 10~ 26 


rs2880301 


13 


18998534 


TPTE2 


T/C 


0 


1.359 x 10~ 25 


rsl2741415 


1 


200741397 


PPP1R12B 


A/G 


0 


1.763 x 10~ 25 


rs6944297 


7 


63937080 


ZNF138, LOC168474 


T/G 


0 


2.458 x 10~ 25 


rs6836144 


4 


119595470 


LOC100128177, LOC100420037 


A/C 


00 


5.355 x 10~ 25 


rsl556557 


1 


241046639 


RSL24D1P4, LOCI 00 129949 


A/G 


0.006211 


1.77 x 10~ 24 


rs 70391 17 


9 


97097001 


FANCC 


C/T 


0.006617 


2.553 x 10~ 23 


rs69 17603 


6 


30125050 


ETF1P1, C60rfl2 


C/T 


0.0559 


6.801 x 10~ 2 ° 


rs9636470 


2 


87947576 


LOC730268, LOC100419917 


G/A 


3.569 


3.869 x 10~° 8 


rsll635160 


15 


82607789 


ADAMTSL3, UBE2Q2P1 


A/G 


0.2805 


7.955 x lO" 08 



In bold are the SNPs that also were identified in an Affymetrix 6.0 data set by directly comparing probe intensities. 
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BLAST results of rsl2372818 flanking sequence (near HT2RA gene) 



Rsl2372818, chromosome 13: 

Homology sequence 1 , chromosome Y: 
Homology sequence 2, chromosome Y: 
Homology sequence 3, chromosome 3: 



TAATAATCATTGATTCCTGCTAGTCC [ A/G] ATTAATTCCATGTCTGACTTCTGAA 
TAATAATCATTGATTCCTGCCAGTCC A ATTAAATCCATGTCTGACTTCTGAA 
TAATAATCATTGATTCCTGCCAGTCT A ATTAAATCCGTGTCTGACTTGTGAA 
TAATAATCATTGATTCCTGCCACTCT G ATTAAATCCATATCTGACTTCTGAA 



Minor allele (A) frequency in Hapmap v2: 
Females 0.02174 Males 0.5 
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Figure 1. BLAST alignment analysis of the flanking sequence of a sex-associated SNP (rsl2372818 on chromosome 13). Two homologous sequences 
are present on the Y chromosome (and one on chromosome 3). The presence of the 'A' variant on chromosome Y is responsible for a higher 
frequency of the minor allele in males. 



we found that some groups of probes with high 
homologies on sex chromosomes had a significantly 
higher association to sex. Interestingly, in Data set 2 
(Supplementary Figure SI), this was still true after exclu- 
sion of all SNPs showing an association to sex after 
Bonferroni correction. 

The 7 other SNPs did not exhibit such strong homology 
between their flanking sequences and sex chromosome se- 
quences. We therefore investigated the possibility that a 
competition would occurr between restriction fragments 
from the sex chromosomes and autosomal restriction frag- 
ments, with respect to hybridization of the oligonucleotide 
probe. 

We used BLAST to search the entire genome for se- 
quences exhibiting homology within the larger region cor- 
responding to the restriction fragment containing the 
SNP. We found that the restriction fragments from four 
of the seven SNPs associated with sex had homologous 
sequences located on the sex chromosomes. All align- 
ments of the SNP-flanking regions and their locations 
can be found in Supplementary Data sets S4 and S5. 
Supplementary Data set S6 shows the alignment of a rep- 
resentative SNP restriction fragment with sex DNA. 

Schematics of the two mechanisms that we have 
identified as possibly biasing SNP association results are 
presented in Figure 3. 

The study of probe intensities increases sensitivity in the 
search for SNP interference 

Because no strong sex chromosome homology was found 
for some SNPs, and because we found that homologies 
often were present on other autosomes, we suspected 
that weaker and/or repeated homologies might be suffi- 
cient to influence microarray results for some SNPs. 



To verify this hypothesis and because we were surprised 
that some loci were associated with sex in some data set 
and not in others, we decided to use a fifth data set for a 
more refined analysis. 

This time, we analyzed the probe intensity values (rather 
than the genotype) in relation to the sex status on a fifth 
data set (Affymetrix 6.0). We found that 126 autosomal 
SNPs were significantly influenced by sex status 
(Supplementary Table SI). Remarkably, this intensity- 
based approach (studying a continuous variable) proved 
to be very powerful for detecting the influence of sex on 
these SNPs, as it allowed the identification of twice as 
many loci from only one data set as had been identified 
in four different data sets using the genotype-based 
approach. In addition, these results corroborated the 
results obtained by comparing genotypes (33 of 64 SNPs 
identified by the genotype-based approach were located in 
regions identified by the intensity-based approach). This 
confirms that most of the associations we observed were 
neither fortuitous nor specific to a single microarray tech- 
nology but were relevant to all SNP microarrays. 

However, one SNP near PTGER4 was found to be 
strongly associated with sex in the Illumina data set 
(P = 1.53 x 10~ 3 °), but it was far from significant in the 
225 SNPs neighboring PTGER4 in the Affymetrix 6.0 
data set (Supplementary Table S2), indicating that the as- 
sociation with sex status was restricted to a single SNP, 
not the entire locus. 

A literature-wide search for these loci allowed to detect 
and correct errors because of a sex-related bias 

We searched for published genotyping studies that had 
reported the identification of loci containing sex- 
dependent SNPs. We found that the post hoc verification 
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Figure 2. Sex-association score (Chi square statistics) versus the homology score (BLAST raw score) in data set 1 (Illumina 370 k). For each level of 
homology, mean diamonds with 95% confidence interval. **P< 0.0001 versus no homology (0). 



Sexual chromosome Autosomal 
restriction fragment restriction fragment 




Probes of SNPs from X chromosome Probes of SNP from an autosome 

Figure 3. Schematic view of the hybridization of DNA to a microarray 
probe. Three possibilities include theoretical hybridization, rogue hy- 
bridization with a homolog, and bulk hybridization of genomic DNA 
that sequesters the restriction fragment away from the probe. (1) 
Hybridization of the target sequence with the probe, according to 
theory. (2) Hybridization of a sex chromosome sequence with the 
probe of a homologous autosomal SNP, competing with the theoretical 
autosomal restriction fragment. (3) Hybridization of a sex chromosome 
restriction fragment with an autosomal SNP restriction fragment, 
competing with the microarrays' oligonucleotide probe. (4) 
Oligonucleotide probes for sex chromosomes' SNPs hybridize with 
the same restriction fragment as probes for autosomal SNPs and are 
thus statistically correlated. 



of genotyping analysis often was impossible (because of 
difficulties in getting access to the raw data). Another dif- 
ficulty stems from the frequent use of imputation to create 
virtual SNPs from other nearby SNPs, e.g. to merge data 
from various microarray platforms. This means that a 
SNP with a flanking sequence duplicated in the genome 
can be imputed to a virtual SNP with a unique flanking 
sequence. However, we were able to select three studies in 
which cryptic duplications of SNP flanking sequences on a 
sex chromosome have led to the publication of erroneous 
results: 

Study 1. A PPP1R12B allele was found to be preferen- 
tially transmitted from parents to offspring, a phenom- 
enon that is called 'transmission distortion' by the 
authors (11). They note that a SNP near PPP1R12B has 
an unexpectedly high frequency of heterozygotes when it is 
transmitted from male parents. In the light of our data 
showing that PPP1R12B lies in a region duplicated in 
the Y chromosome causing the SNPs near PPP1R12B to 
be biased by sex, it is more likely because of the transmis- 
sion of a 'third PPP1R12B allele' on the Y chromosome 
from father to son. 

Study 2. A SNP near TPTE2 was found to be associated 
with the presence of hepatocarcinoma in patients with 
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liver cirrhosis (8). This association was in fact due to an 
homology of TPTE2 region on the Y chromosome and a 
sex ratio of 3.4 in hepatocarcinoma versus 1.3 in liver cir- 
rhosis (12). The authors removed TPTE2 from their 
results after our letter (13). 

Study 3. A locus near PTGER4 was found to be 
associated with multiple sclerosis in a meta-analysis (14). 
We have found the association of a SNP near PTGER4 
with sex in Illumina (because of a sequence homology on 
the Y chromosome) but not in Affymetrix microarray (c.f. 
text above, Table 1 and Supplementary Data set SI). This 
meta-analysis used Illumina data in 37% of the 2624 cases 
and only 12% of the 7220 controls, indicating that the 
proportion of males tested on Illumina platform (i.e. 
subject to the sex-related bias toward PTGER4) was 
larger in cases than in controls (10 versus 3%). The 
virtual SNP rs6896969 (obtained by imputation after 
merging Affymetrix and Illumina data) near PTGER4 
was associated (without correction on sex and cohort of 
origin) with the disease with P = 10~ 7 . The authors, 
because of a strong preponderance of women, performed 
a second analysis using sex and cohort of origin as 
covariates to identify additional suceptibility loci for the 
disease. They publish in Supplementary Data, but do not 
comment, that the association of rs6896969 with the 
disease loses genome-wide significance (P = 10~ 2 ). By 
joining their analysis (unadjusted on sex and cohort of 
origin) with the replication analysis, they find a significant 
association of PTGER4 with the disease, a result they 
highlight in their conclusion. We think this association 
is due to the bias we describe here and has no 
physiopathological significance. 

Replicated sequences modify the results of SNPs 
regardless of sex 

We next asked whether replication of an autosomal 
sequence containing a SNP on another autosome could 
influence the microarray result concerning that SNP. 
This is especially important as the filters usually used on 
the data set (e.g. Hardy-Weinberg equilibrium, no-call 
rate) would be less likely to eliminate SNPs with 
interautosomal homologies than SNPs with homologies 
on sex chromosomes. As these replicated sequences 
could be present anywhere in the genome, it would take 
~5xlO n correlation calculations to investigate all 
possible combinations. This analysis would require both 
more computational power than we have and more 
patients to achieve the statistical power required to take 
the necessary multiple test correction into account (15). 
Instead, we chose to study the correlation of autosomal 
SNPs that were associated with sex in the first step of our 
study. We chose to study these SNPs in women, as they 
have two X chromosomes and, thus, a SNP distribution 
similar to autosomes. This provided us with a model of 
interautosomal SNP correlation in which we had only a 
handful of SNPs to test, preselected for their high prob- 
ability to beeing influenced by chromosome X. We looked 
for significant correlations between a selection of 46 auto- 
somal SNPs we had found to be influenced by gender 



status and any of the SNPs located on the X chromosome. 
We found that 31 autosomal SNPs (67%) were signifi- 
cantly correlated with at least one (but up to 10 3 ) SNPs 
on the X chromosome (Supplementary Data set S7). We 
used BLAST to search for alignment of the X chromo- 
some with either the sequences of the autosomal SNPs' 
probes or the SNPs' restriction fragments. Thus, we 
identified repetitive homologies in loci from the X 
chromosome in locations where we had found SNPs 
with significant correlations with autosomal SNPs. 
Interestingly, we found that some SNPs could be signifi- 
cantly correlated even when only short homologous se- 
quences were involved (Table 2). 



DISCUSSION 

Since the first GWAS using SNP microarrays, the reality 
of discoveries using this method has been the subject of 
intense debate (2,16). Although such an unbiased, system- 
atic genome-wide approach is very appealing, technically, 
this approach consists in looking for a needle in a 
haystack without knowing what the needle looks like. 
Here, we found that SNP microarrays, although they 
varied in design and were performed on different individ- 
uals, yield reproducible information that correspond to 
true biological properties. Our independent association 
studies on gender repeatedly highlighted SNPs related to 
the sex chromosomes by sequence homology. However, 
our findings also demonstrate that, to date, technical 
flaws pertaining to SNP microarrays have occurred, af- 
fecting the information that they are designed to 
retrieve. Although it can be expected that the association 
of a SNP with a given phenotype will reflect a molecular 
mechanism involving the single genomic region surround- 
ing that very SNP, it actually integrates many interactions 
between more-or-less homologous sequences also subject 
to variations but without any relevance with respect to the 
studied locus. Our results show that, in four separate data 
sets obtained from various genotyping platforms, some 
SNPs systematically give spurious results. Although 
these homologous sequences are easily detectable when 
they are located on sex chromosomes, they are not sys- 
tematically eliminated, which exposes to the posiibility of 
misleading findings. We have verified that our results have 
practical applications in GWAS, showing that this bias 
has led the authors of these studies to identify statistical 
associations of SNPs with a phenotype with no underlying 
biological relevance (8,11,14). 

We also demonstrate that homologies between two 
autosomal regions cause errors that may be both more 
frequent and more cryptic. Thus, extending our findings 
on sex chromosomes to the whole genome should detect 
other yet unrecognized homologies. Overall, the condi- 
tions of our analysis, which was performed on a limited 
number of subjects and investigated effects because of sex 
chromosomes only, suggest that the actual number of 
SNPs that confer a bias and jeopardize the interpretation 
of the results might be much greater. 

The presence of artifactual results in microarrays has 
been predicted in previous publications. In Musumeci's 
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Table 2. Example on SNP rsl3269433 of convergent approaches using BLAST sequence alignment to identify interautosomal SNP homologies 
and a correlation test to identify interdependent SNPs 

rsl3269433, chromosome 8, near MFHAS1 



Flanking sequence: ATATATATCAGCCAGA[T/C]GTGCCACGTGAGCCTG 



Diast nits 
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Position on 
chromosome X 


rsiu 


Correlation (/*) 
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From left to right, for each line, the gene nearest to BLAST hit (region of homology to rsl3269433 on chromosome X), the aligned sequence, its 
position on chromosome X, the correlated SNPs in the same region, its correlation factor /■ and its P-value. 



in silico study, the presence of duplicated sequences with a 
single nucleotide difference was estimated to represent 
8.3% of all SNPs from the dbSNP database (17). The 
authors argued that these duplicated sequences polluted 
microarrays with SNPs that could never be associated 
with the studied phenotype. In a more recent study, 
Doron and Shweiki (18) show that 11.9% of Hapmap 
SNPs align to the genome non-uniquely (30 nt's 
upstream and downstream to SNP position). They 
suggest that the SNP uniqueness problem is a potentially 
massive bias in genotyping analysis. Here, we show that it 
indeed leads to false-positive associations. We show that 
replicated sequences actually can be responsible for the 



identification of false associations. Furthermore, we 
show that even weak homologies can modify the micro- 
array results. These data corroborate the experimental 
results of Eklund, who showed that even weak similarities 
are sufficient to bias microarray probes when 10% of 
hemoglobin cDNA is added to the chip (19). In our 
study, we use a genome-wide approach, focusing our 
study on microarray data. However, the bias we dis- 
covered is not specific to microarray technology but 
could occur in other types of genotyping study. 

The genome is known to be rich in repetitive sequences 
(20). Most of these sequences are considered to be 'junk 
DNA' because they have no functional promoter regions 
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and are not expressed. Sex chromosomes contain large 
amounts of these repetitive sequences (21-23). The telo- 
meric regions are especially rich in repetitive sequences 
and are especially prone to neomutation (24). Thus, a 
special attention should be paid to these repetitive se- 
quences when studying a pathological trait (24), and du- 
plications should be taken into consideration when 
interpreting GWAS. 

Studying genotypes can point at true statistically 
relevant association between marker and traits, which 
cannot be sorted out by increasing sample sizes (25). 
Our study shows that cryptic sequence duplication can 
cause such indirect association between markers and 
traits, but we believe that our results may help microarray 
constructors and bioinformatics specialists to improve the 
design of array chips and the processing of their results to 
make GWAS more reliable. SNPs with flanking sequences 
that are not specific to a single genomic region should be 
replaced every time a SNP with specific flanking sequences 
exists within the same region (17,18). At minimum, micro- 
array constructors should clearly mention this ambiguity 
in their annotation file. Our guess is that the SNPs have 
not been updated since the completion of human genome 
sequencing (26), and the selection of these misleading 
SNPs might have promoted their high level of apparent 
heterozygosis. As sequence duplication are frequent in the 
genome, the risk of including a SNP with duplicated 
flanking sequence is high (17,18). Lastly, it might be that 
some of these duplications did not exist in the populations 
where the SNPs were first reported. The discrepancy we 
found between the Illumina and Affymetrix microarrays 
concerning the PTGER4 bias not only indicates that the 
bias 'can' be avoided, at least in some cases, but also 
stresses the difficulty of pooling data obtained from 
various microarray platforms. 

In the meantime, we recommend that the interpretation 
of previous and future GWAS be reconsidered in light of 
our findings. Errors in GWAS results caused by repetitive 
sequences can be avoided by several means. The usual 
exclusion tests (no-call SNPs, SNPs with low minor 
allele frequency, and SNPs not matching Hardy- 
Weinberg equilibrium) are useful, but applying them 
more strictly might eliminate SNPs that are strongly 
influenced by natural selection (27). Instead, we suggest 
three steps that provide more confidence in the results 
without excluding SNPs from the analysis. First, stratifi- 
cation based on sex and on the platform used for 
genotyping should be performed systematically, even if it 
could diminish the statistical power of the analysis (27,28). 
Second, the specificity of all identified SNP sequences 
should be systematically checked. This should include a 
genome-wide alignment of the SNP-flanking sequences 
and of the restriction fragments. If significant homologies 
are found in other genomic regions, these should be con- 
sidered as susceptibility loci as well. However, we found 
that, in some cases, the effect of replicated sequences on 
SNP results is difficult to predict by sequence alignment, 
especially when the homologies are weak. Third, the full 
sequencing of the susceptibility loci associated with one 
SNP should help identify which of the replicated se- 
quences is truly associated with the phenotype. 



In sum, our findings underscore the need for a very 
thoughtful analysis of SNPs associated with a phenotype 
to discriminate misleading data devoid of any biological 
relevance. We would like to stress that the raw data from 
previously published studies should be available to the 
scientific community. In practice, external access to data 
for verification purposes is difficult, delayed and some- 
times denied (although data are duly referenced in the 
dbGAP database) (29,30). We urge authors who have 
reported strong statistical associations of SNPs with 
diseases to perform a secondary analysis. Some SNPs 
that are not surrounded by any relevant gene with 
respect to a specific disease may have been selected 
because of their duplication on sex chromosomes or 
even on autosomes. 
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